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We explore the construction of nonsubjective prior distributions 
in Bayesian statistics via a posterior predictive relative entropy re- 
gret criterion. We carry out a minimax analysis based on a derived 
asymptotic predictive loss function and show that this approach to 
■ prior construction has a number of attractive features. The approach 

^0 ' here differs from previous work that uses either prior or posterior 

relative entropy regret in that we consider predictive performance in 
relation to alternative nondegenerate prior distributions. The theory 
is illustrated with an analysis of some specific examples. 



1. Introduction. There is an extensive literature on the development of 

, objective prior distributions based on information loss criteria. Bernardo [5] 

' obtains reference priors by maximizing the Shannon mutual information be- 

. tween the parameter and the sample. These priors are maximin solutions 

I under relative entropy loss; see, for example, [3, 8] for further analysis, dis- 
cussion and references. In regular parametric families the reference prior for 

. the full parameter is Jeffreys' prior. It is argued in [5], however, that when 

I nuisance parameters are present, then the appropriate reference prior should 

c~| ■ depend on which parameter(s) are deemed to be of primary interest. This 

"j^ i dependence on parameters of interest is mirrored in the approach to prior 

^ ' development via minimization of coverage probability bias; see, for example, 

. [11, 23, 25] for further aspects of this approach. 

', In the present paper we explore the construction of nonsubjective prior 

^ ' distributions via predictive performance. It is possible to use Bernardo's ap- 
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proach to obtain reference priors for prediction. However, as shown in [5], 
this program turns out to be equivalent to obtaining the reference prior for 
the fuh parameter, which produces Jeffreys' prior in regular problems. Fur- 
ther analysis along these lines is carried out in [17]. Datta et al. [12] explore 
prior construction using predictive probability matching, which is shown 
to produce sensible prior distributions in a number of standard examples. 
In the present article we follow Bernardo [5] and Barron [3] by taking an 
information-theoretic approach and using an entropy-based risk function. 
However, here we focus on the posterior predictive relative entropy regret, 
as opposed to the prior predictive relative entropy regret used by these au- 
thors. Our starting point is the predictive information criterion introduced 
by Aitchison [1], which was also discussed by Akaike [2] as a criterion for 
the selection of objective priors. We depart from these and other authors 
by taking a more Bayesian viewpoint, in that we are less concerned here 
with performance in repeated sampling but rather with performance in re- 
lation to alternative prior specifications. The main aim of the paper is to 
search for uniform, or impartial, minimax priors under an associated predic- 
tive loss function. These priors are also maximin, or least favorable, which 
can be interpreted here as giving rise to minimum information predictive 
distributions. 

The organization of the paper is as follows. We start in Section 2 by defin- 
ing the posterior predictive regret, which measures the regret when using a 
posterior predictive distribution under a particular prior in relation to the 
posterior predictive distribution under an alternative proper prior. We define 
a related predictive loss function and argue that this is a suitable criterion 
for the comparison of alternative prior specifications. We discuss informally 
the results in Section 6 on impartial, minimax and maximin priors under a 
large sample version of this loss function. We also give a definition of the pre- 
dictive information in a prior distribution. Throughout we make connections 
with standard quantities that arise in information theory. In Section 3 we 
relate posterior predictive regret and loss to prior predictive regret and loss 
and in Section 4 we obtain the asymptotic behavior of the posterior predic- 
tive regret, which is obtained via an analysis of the higher-order asymptotic 
behavior of the prior predictive regret. The higher-order analysis carried out 
in Section 5, which is of independent interest, leads to expressions for the 
asymptotic forms of the posterior predictive regret, predictive information 
and predictive loss. In Section 6 we investigate impartial minimax priors 
under our asymptotic predictive loss function. It turns out that these priors 
also minimize the asymptotic information in the predictive distribution. In 
the case of a single real parameter, Jeffreys' prior turns out to be minimax. 
However, in dimensions greater than one, the minimax solution need not be 
Jeffreys' prior. The theory is illustrated with an analysis of some specific 
examples, and some concluding remarks are given in Section 7. 
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There are a number of appealing aspects of the proposed Bayesian predic- 
tive approach to prior determination. First, since the focus is on prediction, 
there is no need to specify a set of parameters deemed to be of interest. Sec- 
ond, difficulties associated with improper priors are avoided in the formula- 
tion of posterior predictive, as opposed to prior predictive, criteria. Third, 
the minimax priors identified in Section 6 arise as limits of proper priors. 
Fourth, these minimax priors are also maximin, or least favorable for predic- 
tion, which can be interpreted here as minimizing the predictive information 
contained in a prior. Finally, and importantly, the same asymptotic predic- 
tive loss criterion emerges regardless of whether one is considering prediction 
of a single future observation or a large number of future observations. 

2. Posterior predictive regret and impartial priors. Consider a para- 
metric model with density p{-\0) with respect to a cr-finite measure fi, where 
9 = (9^, . . . , 9P) is an unknown parameter in an open set C TZ"^, p>l- Let 
p^{x) = J p{x\9) dTT{9) be the marginal density of X under the prior distri- 
bution vr on 0, where both vr and p'^ may be improper. Let H be the class 
of prior distributions vr satisfying p'^{X) < oo a.s. {9) for all 9 €Q. That is, 
vr G n if and only if P^{{X : p^{X) < oo}) = 1 for all 6* € G. 

We suppose that X represents data to be observed and Y represents future 
observations to be predicted. Denote by p'^{y\x) the posterior predictive 
density of Y given X = x under the prior vr G H. Let C 11 be the class of 
all proper prior distributions on Q. For vr G 11 and r G ^2, define the posterior 
predictive regret 

(2.1) dy|x(r,^)= j I logl^^^y ix,y)df^{x)dfi{y). 

We note that vr) is the conditional relative entropy, or expected 

Kullback-Leibler divergence, D{p'^ {Y\X)\\p'^ {Y\X)) , between the predictive 
densities under vr and r. See, for example, the book by Cover and Thomas 
[10] for definitions and properties of the various information-theoretic quan- 
tities that arise in this work. It follows from standard results in information 
theory that the quantity (iy|x(T, vr) always exists (possibly +oo) and is non- 
negative. It is zero when vr = r and is therefore the expected regret under 
the loss function — logp'^{y\x) associated with using the predictive density 
p'^{y\x) when X and Y arise from p'^[x) and p'^[y\x), respectively. 

When r = {0}, the distribution degenerate at G 0, we will simply write 
c^y|x(T",7r) = dy|x(6',vr), where 

(2.2) dy\x{e,^) = j j log|^^}p(x,2/|0)d/i(x)d/.(y) 

is the expected regret under the loss function — logp'^(y|x) associated with 
using the predictive density p'"{y\x) when X and Y arise from p{x\9) and 
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p{y\x,6), respectively. The regret (2.2) is the conditional relative entropy 
D{p(Y\X,e)\\p'^{Y\X)). The readily derived relationship 



implies that (2.2) is a proper scoring rule, as pointed out by Aitchison [1]; 
that is, the left-hand side of (2.3) attains its minimum value over vr G 11 
when TT = T. We note that the final integral in (2.3) is the Shannon con- 
ditional mutual information I{Y;9\X) between Y and 6 conditional on X 
(under the prior r). Conditional mutual information has been used by Sun 
and Berger [21] for deriving reference priors conditional on a parameter to 
which a subjective prior has been assigned, and by Clarke and Yuan [9] for 
deriving possibly data-dependent "partial information" reference priors that 
are conditional on a statistic. 

Definition (2.1) of the posterior predictive regret is motivated by standard 
arguments for adopting the logarithmic score log q(Y) as an operational util- 
ity function when using q as a predictive density for the random quantity 
Y; see, for example, the discussion in Chapter 2 of [6]. The criterion (2.2) 
was used by Aitchison [1] for the purpose of comparing the predictive per- 
formance of estimative and posterior predictive distributions, which was fol- 
lowed up by Komaki [16], who considered the associated asymptotic theory 
for curved exponential families. Hartigan [14] obtained related higher-order 
asymptotic expressions which he used to compare estimative predictive dis- 
tributions based on (bias-corrected) maximum likelihood and Bayes esti- 
mators. Akaike [2] discussed the use of (2.2) for the selection of objective 
priors. A similar approach was also proposed by Geisser in his discussion of 
Bernardo [5]. Recently, Liang and Barron [19] have derived exact minimax 
priors under the criterion (2.2) for location and scale families. 

The criterion (2.1) extends the domain of definition of (2.2) from degen- 
erate priors {9} to all proper priors r S fi. We argue that (2.1) is a suitable 
Bayesian performance characteristic for assessing the predictive performance 
of a nonsubjective prior distribution tt when 9 arises from alternative proper 
prior distributions r. There are two ways of thinking about this. First, we 
might be interested in the predictive performance of a proposed nonsubjec- 
tive prior distribution under its repeated use, as opposed to its performance 
under repeated sampling, as measured by (2.2). From this point of view, we 
could consider the prior selection problem as an idealized game between the 
Statistician and Nature, in which each player selects a prior distribution. 
An alternative viewpoint is to consider (2.1) as measuring the predictive 
performance of tt in relation to a subjective prior distribution r that is as 
yet unspecified. Thus, r might reflect the prior beliefs, yet to be elicited, 
of an expert. In this case the prior selection problem could be viewed as a 
game between the Statistician and an Expert. It is possible, of course, that 



(2.3) 
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the Statistician and Expert are the same person, whose prior behefs have 
yet to be properly formulated. 

Akaike [2] considered priors that give constant posterior predictive regret 
(2.2), referring to such priors as uniform or "impartial" priors. Such priors 
will only exist in special cases, however. Achieving constant regret over all 
possible priors r € in (2.1) is clearly never possible since, for any fixed vr € 
n, the precision of the predictive distribution under r will tend to increase 
as r becomes more informative, in which case dy|x('7", tt) will eventually 
increase. Alternatively, since r is unknown, one might wish to consider the 
minimaxity of vr over all r G 0. However, the maximum regret will tend to 
occur at degenerate r. We would therefore be led back to the frequentist 
risk criterion (2.2), which is not the object of primary interest in the present 
paper. 

For these reasons, we will study the loss function 

(2.4) Ly\x (r, vr; vr^) = dY\x (r, vr) - dY\x {r, ) , 

provided that this exists (see later), which is the posterior predictive regret 
associated with using the prior vr compared to using a fixed base prior vr^ G 
n. Since we will be investigating default priors for prediction, it is necessary 
that our procedure for choosing the base measure vr^ is such that p^{y\x) 
does not depend on the particular parameterization of the model that is 
adopted. We are therefore inevitably led to a choice of base measure that 
is invariant under arbitrary reparameterization. In the case of a regular 
parametric family, an obvious candidate for vr^ is Jeffreys' invariant prior 
with density proportional to \I{6)\^/'^, where I{9) is Fisher's information 
in the sample X. Since we will only be considering regular likelihoods in 
the rest of this paper, we take vr^ = vr"' in the sequel and simply write 
^Y|x(7",vr;vr-^) =Ly|x(T,vr). 

Assume that the base Jeffreys' prior vr^^ satisfies (iy|x(^, ti""^) < oo for all 
G and let p'^{y\x) be the conditional density of Y given X under vr''. 
Then the ( posterior) predictive loss function defined by 

LY\x{O,7:) = dY\x{0,^)-dY\x{Oy) 

(2.5) ff^ {p^{y\x) 



is well defined, although possibly +oo. Now let VLy\x C be the class of 
proper priors r for which / (iy|x(^, ti""^) dT{9) < oo. Then for vr G 11 and r G 
^y\Xj we can define the expected predictive loss 



Ly|x(T,vr) = J LY\x{0,T^)dr{9) 

(2-6) = j dYix{O,n)dT{0)- J dY\x{ey)dT{e) 

= (iy|x(r,vr) - (iY|x(T,vr-^), 
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as in (2.4). Since r G ^y\Xi the final line is well defined (possibly +00). 
Next we define, for r € il, 

(2.7) CY\x{r) = dY\x{r,7r-') =JJ logl^^^'jp^x,y) df,{x) df,{y). 

Since the negative conditional relative entropy —dY\xiTj'^'^) = ~D{p'^{Y\ 
X)\\p'^ {Y\X)) is a natural information-theoretic measure of the uncertainty 
in the predictive distribution p'^{Y\X), we will refer to Cy\x{t) the pre- 
dictive information in r. Here p'^ {y\x) acts as a normalization of the con- 
ditional entropy of p'^{y\x). From relation (2.3) with tt = vr'^, we see that 
Cy\x{t) ^ / dY\x{G^''^'^) dT{9), from which it follows that sup^^qCy\x{''') = 
supgge Cy|x({^}). That is, the maximum predictive information occurs at 
(or near) a degenerate prior. Thus, CY\xiT) is a natural entropy-based mea- 
sure of the information in the predictive distribution p'^{y\x). Note that, 
again from (2.3), Cy\x{t) < °° whenever r G ^y\x- 

It now follows from (2.3), (2.6) and (2.7) that, for vr G 11 and r G ^y\Xj 
we can write 

(2.8) |x (r, vr) = Ly |x (r, vr) + Cy\x (t) . 

We will explore priors for which Ly| Y(^,vr) is approximately constant in 
9 gQ. Notice that if Ly|x(0,vr) is approximately constant, then, from (2.8), 
(iy|x(T, vr) is approximately constant over all r having the same predictive 
information Cy|x (''"). This therefore provides a suitable notion of approxi- 
mate uniformity of the posterior predictive regret (2.1). 

In Sections 4 and 5 we will derive large sample forms, L{9, vr), L(t, vr), (^(t) 
and d(T, vr), respectively, of suitably normalized versions of Ly|_Y(^,vr), 
L'y\x{tj'^)Xy\x{t) aiid (iy|x(T, vr) and simply refer to L{9,-k) as the pre- 
dictive loss function. Importantly, for smooth priors vr this asymptotic loss 
function will not depend on the amount of prediction Y to be carried out. 
In Section 6 we will investigate uniform and minimax priors under predic- 
tive loss. As is often the case in game theory, there is a strong relationship 
between constant loss, minimax and maximin priors. We give an informal 
statement of Theorem 6.1. An equalizer prior is a prior vr for which the pre- 
dictive loss function L(0,vr) is constant over 6 € Q. Suppose that vro is an 
equalizer prior and that there exists a sequence of proper priors in the 
class <I> C ri, to be defined in Section 4, for which d{Tk, ttq) ^ as k ^ 00. 
Then Theorem 6.1 states that vro is minimax with respect to -L(r, vr) and 
(^(vro) = inf,-g$ ^(r); that is, vro contains minimum predictive information 
about Y. This latter property is equivalent to vrg being maximin, or least 
favorable, under L(r, vr). Since by construction L(r, vr"^) = for all r G <1>, vr"' 
is automatically an equalizer prior. However, there may not exist a sequence 
Tfc of proper priors with (i(rfc,vr'^) 0, in which case Jeffreys' prior may not 
be minimax. Some examples will be given in Section 6. 
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Although the focus of this paper is on the general asymptotic form of 
the predictive loss, we briefly note the implications of adopting either the 
posterior predictive regret (2.2) or the predictive loss (2.5) in the special 
case where the family p{-\0) of densities is invariant under a suitable group 
Q of transformations of the sample space. See, for example. Chapter 6 in 
[4] for a general discussion of invariant decision problems. Let Q be the 
induced group of transformations on B. Then the predictive loss (2.5) is 
invariant under Q and the invariant decisions are invariant priors satisfying 
Tr{g{9)) (X 7r{9)\d9 /dg{6)\ for all g gQ. If the group Q is transitive, then 
the predictive loss is constant for every invariant prior. Furthermore, if we 
consider the broader decision problem in which we replace p'^{-\x) by the 
arbitrary decision function 6{x) = q^, where qx{-) is to be used as a predictive 
density for Y when X = x, then it can be shown that p^[y\x), the posterior 
predictive density under the right Haar measure on Q, is the best invariant 
predictive density under the posterior predictive regret (2.2). Since tt'^ is an 
invariant prior, it further follows that the right Haar measure is the best 
invariant prior under the predictive loss function (2.5). Since submission 
of the final version of the present paper, a careful analysis using (2.2) for 
location and scale families has appeared in [19]. 

Returning to the definition of the predictive loss function (2.4) relative 
to an arbitrary base measure vr^, we see that this is related to the expected 
predictive loss (2.6) by the equation 

LY\x{T,Tr;TT^) = Ly\x{t,tt) - Ly\x{t,tt^). 

Therefore, using vr^ will give rise to an equivalent predictive loss function 
if and only if Ly|x(0,7r^) is constant in 6. In this case we say that vr^ is 
neutral relative to vr"'. 

3. Relationship to prior predictive regret. In this section we relate the 
posterior predictive regret (2.2) and loss function (2.5) to the prior predictive 
regret and loss function. We will use these relationships in Section 4 to obtain 
the asymptotic posterior predictive regret d(r, vr) and loss L(t, tt). 

For vr G n, we define the prior predictive regret by 

(3.1) dxie,7T) = Dip{x\e)\\p-{x)) = J iog|^|p(x|e)dM^), 

which is the relative entropy D{p(X\6)\\p'^ (X)) between p{x\9) and the prior 
predictive density p'^{x). Note that vr may be improper in this definition. 
In that case, unlike the posterior predictive regret, alternative normalizing 
constants will give rise to alternative versions of (3.1), differing by constants. 
The prior predictive regret (3.1) is the focus of work by Bernardo [5], Clarke 
and Barron [7] and others. Now define Tlx C 11 to be the class of priors vr 
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in n for which dx(^,vr) < oo for ah ^ G B. If vr"' € Hx, then for vr € 11 we 
define the prior predictive loss by 

(3.2) Lx{e,7r) = dx{e,7r)-dx{ey) = J logS^^^^p{x\e) dfi{x), 

which is well defined (possibly +oo). 

The posterior predictive regret (2.2) and loss (2.5) are simply related 
to the prior predictive regret (3.1) and loss (3.2). The following result is 
essentially the chain rule for relative entropy. However, we formally state 
and prove it since, first, the distribution of X may be improper here and, 
second, we need to make sure that these relationships are well defined. 

Lemma 3.1. Suppose that vr G Ilx.y. Then vr G IIx, dyixi^^^) < oo for 
all 6 gQ and 

(3.3) dYixiO,7r) = dx,Y{0,7r) - dx{e,7r). 

If further tt^ G Iix,Y , then Ly|x(0,7r) < oo for all 6 gQ and 

(3.4) LY\x{0,T^)=Lx,Y{e^TT) - Lx{e,7T). 

Proof. Since vr G 11, the marginal densities p^{X) and p'^{X, Y) are a.s. 
(9) finite for all G 0. Therefore, 

P^{x,y)= p{x,y\(l))dTr{(j)) =p''{x) p{y\x , (p) dp"" {(j)\x) = p"" {x)p'' {y\x) , 



since, by definition, p{x\(j)) dTr{(j)) = p'^ (x) dp'^ {(j)\x) . It now follows straight- 
forwardly from the definitions (2.2) and (3.1) that 

(3.5) dx,Y{0, vr) = dY\x{0, vr) + dx{e, vr). 

Since vr G Iix,Yi it follows from (3.5) that both dy|j!f(0,vr) < oo and vr G 
IIx and, hence, relation (3.3) holds. Since vr G IIx and vr"' G IIx, it follows 
from (3.2) that Ly|x(^,7r) is finite for all 9. Finally, since vr*^ G 11, we have 
p'^{x,y) = p'^ {x)p'^ {y\x) and relation (3.4) follows straightforwardly from the 
definitions (2.5) and (3.2). 

Finally, let ^x C O be the class of priors r in 17 satisfying j dx{0, 
7T'^)dT{6) < oo. It follows from equation (3.3) of Lemma 3.1 that vr'' G Ilx,Y 
and T G ^x,Y imply that / dy|x(0, vr"^) dT{6) < oo,r G Qx and 

dY\x{0,T^'^)dT{e)= J dx,Y{0y)dT{9)- J dx{ey)dT{9). 

Therefore, if vr G Ilx,Y and r G f^x.y , then the expected posterior loss Ly|jsf (r, vr) 
at (2.6) is well defined. □ 
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4. Asymptotic behavior of the predictive loss. Throughout the remain- 
der of this article we specialize to the case X = and Y = 
{Xn+i, . . . ,Xn+m), where the Xj are independent observations from a den- 
sity f{x\9) with respect to a measure fi. In the present section we investi- 
gate the asymptotic behavior as n ^ oo of the predictive loss function (2.5). 
In particular, we will show that, under suitable regularity conditions, the 
asymptotic form of (2.5) (after suitable normalization) is the same regard- 
less of the amount m of prediction to be performed. This leads to a general 
definition for broad classes of priors tt and r of the (asymptotic) predictive 
loss L(t, vr), information ({t) and regret d{T,7r). 

For an asymptotic analysis of the posterior predictive regret (2.2) and loss 
function (2.5), from (3.2), (3.3) and (3.4), we see that it suffices to study 
the asymptotic behavior of the prior predictive regret dx{0, vr). Suppose that 
vr € n has a density with respect to Lebesgue measure. For notational conve- 
nience, in what follows we will use the same symbol vr to denote this density. 
Let 1(6) = n-^ logp{X\0) = n'^ ELi ^og f{Xi\e) be the normalized loglikeli- 
hood function and let i{6) = {—I" (9)} = n~^I{9) be Fisher's information 
per observation. A standard result for the prior predictive regret (3.1) when 
vr is a density (see, e.g., [7]) is that, under suitable regularity conditions. 



as n — 5- oo. [Here the vr appearing in the first term on the right-hand side of 

(4.1) is the usual transcendental number and should not be confused with 
the prior vr(-).] Taking Jeffreys' prior to be Ti-^iO) = |i(6l)|i/2^ it follows from 

(3.2) and (4.1) that the prior predictive loss satisfies 



It now follows from (3.4) that, for any sequence m = rUn > 1, Ly\x{^^ t^) = 
o(l); that is, to first order the posterior predictive loss is identically zero 
for every smooth prior vr. It is therefore necessary to develop further the 
asymptotic expansion in (4.1). Let 9 denote the maximum likelihood esti- 
mator based on the data X and assume that the observed information matrix 
J = —nl"{9) is positive definite over the set S for which P^{S) = 1 + o(n~^), 
uniformly in compact subsets of G. 

Let IIoo be the class of priors vr G 11 for which vr € IIx for all n and let 
C C Hoc be the class of priors in Hoc that possess densities having continuous 
second-order derivatives throughout O. Then, under suitable additional reg- 
ularity conditions on / and vr € C to be discussed in Section 5, the marginal 
density of X is 



(4.1) 





p^{x) = {2t:sIyI'^\J\-^''^p{x\9)tt{9){1 + o{n-^)] 
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where = (1 + is a Bayesian Bartlett correction, with = 0{n 



see, for example, [22]. Therefore, we can write 



log 



p{x\9) 



2 ^V2vrey 
7r{0) 



log 



7r(^) 



log 



+ 2 log 



IJI 



+ o[ - 

n 



n{l{0) 



m}-\ 



■ Gin 



Since E^[n{l{e) - l{e)}] =psl{d)/2- 
a frequentist Bartlett correction, with bpiO) 
that 



(4.2) 
where 

(4.3) 



dx(e,vr) = ^log(^)+log 



where 4(0) = {1 + bpie)}'^ is 
= 0(n~^), it follows from (3.1) 

i(0)jl/2 



7r(0) 



hn{e,7r)=p{E%bB)+bF{e)} + E' 



log 



+ 



log 

1 



hn{9,TT) 



Trie) 



<0) 



n 



Under suitable regularity conditions, the leading term in (4.3) turns out 
to be 0(n~^), since both the Bayesian and frequentist Bartlett corrections 
are 0(n~^), as are all the expectations on the right-hand side of (4.3). We 
will therefore suppose that /i„ is of the form 

'D{e,Tr) 



(4.4) 



2n 



+ r„(6',7r), 



where D(0,Tr) is continuous in 6 and the remainder term r„(0,7r) satisfies 
one of the following three successively stronger conditions: 

Rl. r„(0,7r) = o(n~^) uniformly in compacts of @; 
R2. r„(0,7r) = 0{n~^) uniformly in compacts of G; 

R3. r„(0,7r) = E{6, 7r)n~^ + o(n~^) uniformly in compacts of 0, where E[9, tt) 
is continuous in 9. 

The above three forms of remainder require successively stronger assump- 
tions about both the likelihood p{-\0) and the prior vr(0). Suitable sets of 
regularity conditions for the validity of (4.4) will be discussed in Section 5. 
In particular, vr G C is a sufficient condition on the prior for the weakest form 
Rl of remainder. The form of D{0,Tr) for tt G C will be derived in Section 5. 

Throughout the remainder of the paper we assume that tt"^ G C and define, 
for all vr G C, 



(4.5) 



L{e,Tr)=D{e,Tr)-D{e,Tr-' 
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We note that L{9, vr) is well defined when vr is improper since the arbitrary 
normalizing constant in tt does not appear in D{9,7r). We will study the 
asymptotic behavior of the posterior predictive loss (2.5) as n — > oo for an 
arbitrary number rUn > 1 of predictions YJ. Let c„ = 2n(n + mn)/mn- The 
next theorem gives conditions under which 

(4.6) CnLY\x{0,7r)^L{e,7r) 

uniformly in compacts of under each of the forms R1-R3 of remainder. 
Theorem 4.1. 

(a) Suppose that Rl holds. Then (4.6) holds whenever limmin^oo'm'n/n > 

0. 

(b) Suppose that R2 holds. Then (4.6) holds whenever nin ^ oo. 

(c) Suppose that R3 holds. Then (4.6) holds for every sequence {rrin) of 
positive integers. 

Proof. First note that (3.2), (4.2), (4.4) and (4.5) give, on taking 
7r^(e) = |z(0)|V2, 

where f„(0,7r) =r„(0,7r) — r„(0,7r'^). Also note that, since ttGIIoo, Lemma 
3.1 applies for all n. 

(a) From (3.4), (4.7) and Rl, we have Ly|x(6',7r) = c~^L{9,Tr) + o(n~i) 
and (4.6) follows since n~^Cn = 2{m~^n + 1) and limsup^^o^^ m~^n < oo. 

(b) From (3.4), (4.7) and R2, we have Ly|x(6', vr) = c-1L(6', vr) + ©(n'^) 
and (4.6) follows since n~ Cn = 

(c) From (3.4), (4.7) and R3, we have Ly\x{0,-k) = c-^{L{e,T:) + 
d~^E{9,TT)] + o(n~^), where dn = {2(2n + m„)}~^n(n + m„) and 
'E{e,TT) = E{e,Tr) - E{e,T^^). (4.6) follows since d~^ = 0{n~^) and n'^c^ = 
2(m~^ +n^^) is bounded. □ 

Theorem 4.1 tells us that, although the predictive loss function (2.5) cov- 
ers an infinite variety of possibilities for the amount of data to be observed 
and predictions to be made, it is approximately equivalent to the single 
loss function (4.5), provided that a sufficient amount of data X is to be 
observed. Although this is not surprising given the form of (4.7) and the 
relation (3.4), it considerably simplifies the task of assessing the predictive 
risk arising from using alternative priors. We will refer to L(6,Tr) as the 
(asymptotic) predictive loss function. A special case of interest arises when 
rUn = n, which corresponds to prediction of a replicate data set of the same 
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size as that to be observed. Note that in this case (4.6) holds under the 
weakest condition Rl. More generally, Laud and Ibrahim [18] refer to the 
posterior predictive density of Y in this case as the "predictive density of a 
replicate experiment," which they study in relation to model choice. 

Now let r^oo be the class of priors r G for which r G Qx for all n. Al- 
though the expected predictive loss Ly|j)f (r, vr) is well defined (possibly +00) 
when TT G IIoo and r G ^loo, in general, the expected asymptotic predictive 
loss / L{9,7t) dT{6) may not exist, and when it does, additional conditions 
will be needed for it to be the limit of the expected loss c„Ly|x(T, vr). In 
order to retain generality, we will extend the domain of definition of the 
asymptotic predictive loss (4.5) so that it is defined for all vr G Hoo and 
r G Ooo. Thus, for vr G Iloo,T € f^oo and a given sequence (m^) of positive 
integers, we define the {asymptotic) predictive loss to be 

(4.8) L(T,7r) = limsupc„Ly|x(r,7r), 

n^oo 

which always exists (possibly +00). Thus, L(t, vr) represents the asymptot- 
ically worst-case predictive loss when the prior vr is used in relation to the 
alternative proper prior r. Since the degenerate prior r = {6} is in Qoo, (4.8) 
also provides a definition of L{6, it) for all tt G Hoo, ^ G 0, which agrees with 
(4.5) whenever tt G C C Hoo and one of the conditions R1-R3 holds. 

Now define the (asymptotic) predictive information contained in r G Qoo H 
Hoo to be 

(4.9) C{r) = -L{t,t) = hminf c„Cy|x(T) 

and let ^ C f^oo H Hoo be the class of r for which ({t) < 00. Finally, for 
TT G Hoo and r G define 

(4.10) d(r,^)=L(r,7r) + C(r), 

which is the asymptotic form of equation (2.8). The next lemma implies that 
the predictive loss function (4.8) is a <I>-proper scoring rule and that d{T,7r) 
is the regret associated with L(r, vr). 



Lemma 4.1. For all t£^, 



inf L{T,7r)=L{T,T) = -C{T). 



Proof. Let r G By construction, d(r, r) = 0, so we only need to show 
that d{T, vr) > for all vr G Hoo- Since vr G Hoo and r G Oqo H Hoo, we have 
vr G Hxx and r G fijif^y nllx.y for all n and, hence, the quantities Ly\x{Ti ■^) 
and Ly|x(T, r) are both well defined. But LY|x(r, r) < Ly|x(T, vr) and mul- 
tiplying both sides of this inequality by Cn and taking the limsup^^o^ on 
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both sides of the resulting inequahty gives L{t,t) < L(t, vr). The result fol- 
lows from the definition of (i(r, vr). □ 

When vr € C, L{9,7r) is independent of the sequence m„. In general, how- 
ever, both L(t, vr) and ({t) may depend on the particular sequence (m^), 
although we have suppressed this dependence in the notation. Nevertheless, 
the minimax results of Section 6 will be independent of 

5. Derivation of the asymptotic predictive loss function. In this section 
we obtain the form of the function D{6,tt) arising in the 0(n~^) term in the 
asymptotic expansion of the prior predictive regret dx{0, it). This then leads 
to an expression for the asymptotic predictive loss function L{9,tt) for all 
vr € C via relation (4.5). The computations involved in the determination of 
D{9,Tr), which are similar in nature to computations in [14], are technically 
quite demanding. Finally, we deduce expressions for the asymptotic poste- 
rior predictive regret (4.10) and predictive information (4.9) under certain 
conditions. 

Theorem 5.1 below is the central result of this section. Write Dj = d/dO^ , 
J = 1, . . . ,p. Let p = p{9) = logvr(0) and write pr = Drp. We use the summa- 
tion convention throughout. 

Theorem 5.1. Assume that one of the conditions R1-R3 holds. Then 

(5.1) D{9,7r) = Ai9,TT)+M{9), 
where 

(5.2) A{9,7r)=f'prPs + 2Ds{f'pr) 
and M{9) is independent o/vr. 

We will prove Theorem 5.1 via four lemmas, each of which evaluates the 
leading term in one of the terms on the right-hand side of equation (4.3). 
We discuss suitable sets of regularity conditions following the proof. 

For 1 < j,k,r,... < p, define Djkr- = SlW^W""^ "-jkr- = 

{Djkr.--l{9)}g^g, Cjr = —ajr, C = {Cjr),C~ = {c^^),Pjk = Dj^p, Pj^... = Pjk...{9) 

and 

%fc^■■,rs^■■ = %fci...,rsi..(6') = £^^{^ifc«...log/(Xi;6i)i:>rst--log/(Xi;6i)}. 
Also define 

^1 = {Pjr ~^ PjPr)i ^2 ~ '^^jrsui'^ i j 

— r^hjj^j^- pgZ X , — 15Ajj^^ /c^i^^Z"^ % 1 

and 

Ql=A.s^'^ Q2 = kl Q3 = SDs{kijreHn, Qa = K- 
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Lemma 5.1. 

nE'^ibB) ^—(k; + —k% + -k\ + —kl] . 
^ ' 2p V 12 ^ 3 ^ 36 ^/ 

Proof. Comparing with the Bayesian Bartlett correction factor as given 
in equation (2.6) of [13], we obtain 

where 

Hi = C^^'ipjr + PjPr), H2 = 30j>s„C^''c'''", 

Noting that E^{Ha) = k* + o(l),a = 1, ... ,4, the lemma follows from (5.3). 
□ 

Lemma 5.2. 

n6H^)^^(Qi + 3^Q.4Q3 + ^Q4). 

Proof. Comparing with the frequentist Bartlett correction factor as 
given in equation (2.10) of [13], we obtain 

from which the result follows. □ 
Lemma 5.3. 



log 



Trie) 



■ PrV + ^i^^'pjr 



MO) 

where V = V^i^^kj^^t + ^V^i^^kjkt- 

Proof. From [20], page 209, we see that 

(5.4) E\e'') = e'' + n-^lf + o{n"^), 

(5.5) Co-v\e\e') = n~^i'-' + o{n-^). 
By applying Bartlett 's identity, 

kjkt + kj^kt + kkjt + hjk + kj^k,t = 
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(cf. equation (7.2) of [20]), it can be seen that our expression for b"^ agrees 
with that of McCullagh. From (5.4), (5.5) and the Taylor expansion of p{9) 
around 9, we obtain 

E'{pie)} = p{9) + n~h'pr + \n-^Prsf' + o{n-^), 
from which the lemma follows. □ 

Lemma 5.4. 



nE 



log 



11^^)1 

Proof. By the Taylor expansion of ajr = ljr{^) around 0, we get 
(5.6) ajr = kjr{0) + ejr + o{n~^)^ 

where 

/r '-7\ ^jr — ^jr kjr ~l~ ^jrs{^ ^ ) 

^ ' ' + (Ijrs - k.rsW - 0') + ^kjrstiO' - 0'){e' - 0'). 

From (5.6) and (5.7), we obtain 

C = i{9)-E^ + o{n-^), 

where E^, = (ej>). Noting that J = nC, I{9) =ni{9), i[9) positive definite 
and E^ is a matrix with elements of order 0(n~^/^), from the above ex- 
pression for C and standard results on the eigenvalues and determinant of 
a matrix, it follows by the Taylor expansion that 

(5.8) log{ = - \^^{^~\G)E,i-\G)E,] + o{n-^'^). 
Using an expansion for 9^^ — 9^ as in [20], Chapter 7, we obtain 

(5.9) 9' -9' = + i'^Huiljk - k,k) + hkjkt^''i'"XL} + o{n"^/^). 
Substituting (5.9) into (5.7) and using (5.4) and (5.5), it follows that 

(5.10) ^^Cjv) = n-^{kjrsh' + kjrsM'^ + \kjrsti'^) + o(n"^) 
and 

E {djrS-ku) — ^ \^{kjr,ku ^jr'^ku) 

+ {kjrtkku,w ~l~ kkuwkjr,t ~l~ kjj'^kkuw)^ } ~l~ 0(77. ). 

(5.11) 
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While all four terms on the right-hand side of (5.7) are required in eval- 
uating (5.10), only the first two terms on the right-hand side of (5.7) are 
required in evaluating (5.11). The lemma follows on taking expectations on 
both sides of (5.8) and using (5.10) and (5.11) on the right-hand side. □ 

Proof of Theorem 5.1. First, putting Lemmas 5.1 and 5.2 together 
gives 

np{E'{bB) + bpie)} ^ l{(Qi + kl) - i(g3 - k*) + |(Q2 + ^Q4)}. 
Along with Lemmas 5.3 and 5.4, this gives equation (5.1) with 

A{0, Tt) = f'iprPs + 2prs) + 2{kjku + kjk,u)i^'"i^''pv 

Now note that D^ikj = —Dj.E^ (Ikj) = — {k^jr + ^fcj.r) so that 

A{e,-K) = f'iprPs + 2prs) - 2Du{ijk)i'"'"i^'^ Pv 

Finally, 2?,(i,fc)f'="i^'^ = -Z),(f^-")i,fcP^ = -L'Ji™) and so 

^(0,7r) = f'iprPs + 2prs) + 2Ds{f')pr = f'prps + 2Ds{f'pr), 

as required. □ 

We briefly discuss suitable regularity conditions on the likelihood and 
prior for the validity of the three forms of remainder R1-R3, although we 
will not dwell on alternative sets of sufficient conditions in the present paper. 
There are broadly two sets of conditions required, those for the validity of 
the Laplace approximation of p'^(x) and those for the validity of the approx- 
imation of each of the terms in (4.3). Consider first the form of remainder 
R2, ignoring for the moment the uniformity requirement. A suitable set of 
conditions for this form of remainder is given in Section 3 of [15], which con- 
stitutes the definition of a "Laplace-regular" family. Broadly, one requires 
l{6) to be six-times continuously differentiable and 7r{0) to be four-times 
continuously differentiable, plus additional conditions controlling the error 
term and nonlocal behavior of the integrand. Since additionally we require 
uniformity in compact subsets of Q in R2, we need to replace the neigh- 
borhood Bi;{9o) in these conditions by an arbitrary compact subset of G. 
In addition to these conditions, for the approximation of the terms in (4.3) 
we require the expectations of the mixed fourth-order partial derivatives of 
logf{X; 9) to be continuous and also conditions guaranteeing the expansions 
for the expectation of needed in the proofs of Lemmas 5.3 and 5.4, as given 
in [20], Chapter 7. From an examination of the relevant proofs, it is seen 
that a slight strengthening of the above conditions will be required for the 
stronger form R3 of remainder. For example, l{6) and 'k{6) seven-times and 
five-times continuously differentiable, respectively, will give rise to a higher- 
order version of Laplace-regularity. Finally, the weaker form of remainder Rl 
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would apply when l{9) and it{9) are only four-times and twice continuously 
differ entiable, respectively, again with additional regularity conditions con- 
trolling, for example, the nonlocal behavior of the integrand in the Laplace 
approximation and giving uniformity of all the o{n~^) remainder terms. 

Returning to the predictive loss function, it follows from Theorem 5.1 
that, for vr € C, the asymptotic predictive loss function (4.5) is given by 



where ^(0, vr ) = z"fj,z^s + 2Z)s(*''*z^r) and i/ = logvr = ^ log It is inter- 
esting to note that (5.12) is of the same form as the right-hand side of the 
first expression in Theorem 4 of [14], which relates to the comparison of 
estimative predictive distributions based on Bayes estimators. In the case 
of a single prediction (m = 1), the connection can be understood from The- 
orem 7 of [14], which establishes that, to the asymptotic order considered 
here, the Kullback-Leibler difference between the posterior and the asso- 
ciated estimative predictive distributions is independent of the prior. The 
derivation of Theorem 5.1 given here is more direct, as it does not involve 
Bayes estimators. Moreover, our result applies for an arbitrary amount of 
prediction. 

Note that L{6,tt) only depends on the sampling model through Fisher's 
information. The quantity M{6), however, involves components of skewness 
and curvature of the model. We do not consider M{9) further in this paper, 
although its form, which may be deduced from the results of Lemmas 5.1- 
5.4, may be of independent interest. It may be verified directly that L{9, vr) is 
invariant under parameter transformation, as expected in view of (4.6) and 
the invariance of Ly|jf(0,7r). Furthermore, since all the terms in (4.2) are 
invariant, it follows that M{6) = M{6) + A{9 ,t:'^) must also be an invariant 
quantity. In the case p = 1, we obtain the relatively simple expression 



where P is the jth derivative of I. 

Example 5.1. Normal model with unknown mean. As a simple first 
example, suppose that Xi ~ A^(^, 1). Here i{6) = 1 and aui{6) = 7^(6*) = 
so that L{9,tt) = {p'f + 2p" and M((9) = from (5.13). By construction, 
L{6,TT'^) = 0, but note that the improper priors vr'^ oc exp{c(6' — Oq)}, c € 7^, 
also deliver constant loss, with L{9,-k^) = > 0. We will see in Section 6 
that Jeffreys' prior is minimax in this example. Since here M{9) = and 
TT'^{9) oc 1, this result also follows from the exact analysis of the criterion 



(5.12) 



L{9,TT) = A{9,7r)-A{9,TT-^) 




curvature, with 



(2.1) in [19]. 
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Now let O be the class of priors having compact support in and let 
r = n n C. It follows from (4.6) that if vr G C and r € O, then L(r, vr) is 
equal to the expected predictive loss / L{9,'it)t{6) dO. Since r S C, we also 
have C{t) = — J L{9, t)t{9) d9, which is finite since L{9, r) is continuous and, 
hence, bounded on compact subsets of Q. The next result gives expressions 
for the predictive regret (i(r, vr) and predictive information C(r) when tt £ C 
and T E r. The expression for ^(r) here is similar to that given in Theorem 
5 of [14] for the Bayes risk of bias- adjusted estimators. 



Lemma 5.5. Suppose tt s C and r G F. Then 

(5.14) d(T,7r)= J f'ipr- Hr){Ps- fJ's)Td9 
and 

(5.15) C{t) = J f%fir - '^r){fJ'S - i^s)rd9, 
where fi = logr. 



Proof. From (5.2), integration by parts gives 
(5.16) J A{9,TT)T{9)d9 = J f'prPsTd9-2 J f prfJ^r d9 + 2I3{t,tt), 



where 



/5(T,vr)=x:/r/'.r]j:[2;::;;d0(-^) 

s=l 



and 9f{e^-'^) and 9'{e(-'^) are the finite lower and upper limits of integra- 
tion for 9'^ for fixed 9^~^\ the vector of components of 9 omitting 9^. But 
/3(t, vr) = 0, since both vr and r are in C. Therefore, 



(5.17) J A{9,Tr)T{9)d9 = J pr{ps - 2ps)r d9. 
Evaluating (5.17) at vr = r G C gives 

(5.18) J A{9,T)T{9)d9 = - J f'/i^/i.rd^. 
It now follows from (5.17) and (5.18) that 

d(r, vr) = L(r, vr) - L(r, t) = J {A{9, vr) - A{9, t)}t{9) d9 

= J f''{pr{ps - 2ps) + Prl-l's}T d9, 
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which gives (5.14). Since C{t) = d(T, vr ), (5.15) fohows on evaluating the 
above expression at vr = vr"^. □ 

The expression (5.15) for the predictive information ({t) is seen to be 
invariant under reparameterization, as expected. It might appear at first 
sight that C{t) will attain the value zero at r = tt'', but this is not necessarily 
the case since vr"' may be improper and there may be no sequence of priors 
in r converging to tt'^ in the right way: see the next section. Finally, note 
that the form of (i(r, vr) in Lemma 5.5 implies that L{9,ir) is a F-strictly 
proper scoring rule since (i(r, vr) attains its minimum value of zero uniquely 
at vr = r € r. 

6. Impartial, minimax and maximin priors. As expected, for a given 
prior density vr G Hoo, from (4.10) the posterior predictive regret will be 
large when the predictive information (4.9) in r is large. Therefore it is not 
possible to achieve constant regret over all possible r € nor minimaxity 
since the regret is unbounded. Instead, as discussed in Section 2, we consider 
the predictive regret associated with using vr compared to using Jeffreys' 
prior and study the behavior of the predictive loss function 

(6.1) L(r,vr) = d(r,vr)-d(r,vr'^), 

which is the asymptotic form of the normalized version of equation (2.4). 

Adopting standard game-theoretic terminology, the prior vr G Hoo is an 
equalizer prior if the predictive loss L{9,-k) is constant over G 0. This is 
equivalent to the predictive loss (6.1) being constant over all r G T. We will 
therefore refer to an equalizer prior as an impartial prior. The prior vrQ G IIoo 
is minimax if sup^g^ L(r, vro) = W, where 

W = inf supL(T, vr) 

is the upper value of the game. To obtain minimax solutions, we will adopt a 
standard game theory technique of searching for equalizer rules and showing 
that they are "extended Bayes" rules; see, for example. Chapter 5 of [4]. This 
is also the strategy used by Liang and Barron [19] for deriving minimax 
priors under the predictive regret (2.2) for location and scale families. In the 
present context the relevant result is given as Theorem 6.1 below. 

Let C IIoo be the class of priors vr in Hoo for which there exists a 
sequence (r^) of priors in $ satisfying (i) L(Tfc,7r) = / L{9,7r) dTk{0) and 
(ii) d{Tk,7r) — >0. Since L(t, vr) is a proper scoring rule, each is a Bayes 
solution and, hence, can be regarded as a class of extended Bayes solu- 
tions. If vr G is an equalizer prior, then we can unambiguously define its 
predictive information as 



C(vr) = hm CiTk) 

fe— ►oo 
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for any sequence € $ satisfying (i) and (ii) above. This is true since 
L{6,Tr) = c, say, for all 9 € @, and so for every such sequence we have 
L(Tfc,7r) = c for all k from (i). Therefore, from (4.10), 

(6.2) C{Tk) = diTk,TT)-C, 

which tends to — c as /c — > oo. 

Finally, we define the class U C IIoo of priors vr for which 

(6.3) limsupcn supLy|x(^, tt) < oo 

for every sequence {rrin). Clearly, priors in have poor finite sample pre- 
dictive behavior relative to Jeffreys' prior. 

Lemma 6.1. Suppose that vr € C n [/, that Rl, R2 or R3 holds and that 
{rrin) is any sequence satisfying the conditions in Theorem 4.1(a), (b) or 
(c), respectively. Then 

supL(r,7r) < supL(6',7r). 

tG* 6*60 

Proof. Let r € ^>,e > and choose a compact set K dQ for which 
jj^c dT{9) < £. Then 

Ly\x {t, vr) < sup Ly |x (6", vr) + e sup Lyix {0, v) 

so that 

L(t, vr) = lim sup c„Ly |x (r, vr) < sup L{6,tt) + ke 

from (4.6) since vr G C, where k = limsup^^^^ c„ sup^ Ly|x(0, vr) < oo since 
■K . The result follows since e was arbitrary. □ 

We now establish the following connection between equalizer and minimax 
priors. 

Theorem 6.1. Suppose that vro € H Cfl C/ is an equalizer prior, that 
Rl, R2 or R3 holds with vr = vrg and that (nin) is any sequence satisfying the 
conditions in Theorem 4.1(a), (b) or (c) respectively. Then ttq is minimax 
and C(vro) = infi-g^ ("(r) . 

Proof. Define 

W = sup inf L{t, it) 

T-g<J) TTglloo 

to be the lower value of the game. Then W_<W is a standard result from 
game theory. Next, since vro is an equalizer prior, we have L{9, vro) = c, say. 
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for all OgQ. Therefore, = infTrgnoo sup^-g^ -L(r, vr) < sup^g$ L(t, ttq) < 
supggQ L{9, TTo) = c from Lemma 6.1 since ttq £ C nU. Therefore, W <c. 

Since from Lemma 4.1 L(r, vr) is a <^-proper scoring rule, we have 
infTrgHoo -^(''") ^) ^ ^{t^t) = ~C(,t) for every r e <1>. Therefore, TV > 
— inf^-gij, C(''")- Since ttq G there exists a sequence (r^) in <I> with d{Tk, ttq) — 
0. Therefore, since C{Tk) ^ infrg* C(''") ^ —W_ and, from (6.2), C(''"fc) -^—c as 
A: — > oo , we have c < P^. These relations give W < c<W_ and it follows that 
W = c = W. The result now follows from the definitions of minimaxity and 
C(vro). □ 

We see that, under the conditions of Theorem 6.1, the minimax prior ttq 
has a natural interpretation of containing minimum predictive information 
about Y, since the infimum of the predictive information (4.9) is attained 
at r = ttq. Equivalently, ttq is maximin since it maximizes the Bayes risk 
— ("(r) of T G <1> under (4.8) and, hence, is a least favorable prior under pre- 
dictive loss. Notice also that Theorem 6.1 implies that sup^-g^ L(r, ttq) = c, 
regardless of the particular sequence (m„) used. 

We note that for the assertion of Theorem 6.1 to hold we require that ttq 
satisfies condition (6.3). There may exist a prior vri G U'^ which appears to 
dominate the minimax prior ttq on the basis of the asymptotic predictive loss 
function L{9,7r). However, this prior will possess poor penultimate asymp- 
totic behavior since Ly|jf (0,7r) will be asymptotically unbounded. This will 
be reflected in the value of sup^g,j, L(t, vr), which will necessarily be greater 
than supggQ L(0, vr). This phenomenon will be illustrated in Example 6.1. 

Corollary 6.1. Assume the conditions of Theorem 6.1 and addition- 
ally that ttq is proper. Then if Ci'^o) = ~c, where c is the constant value of 
L{0,ttq), then ttq is minimax and C('?i"o) = infrg<i> C(''")- 

Proof. Since d(7ro,7ro) = and J L{9,'Ko) dTTo{9) = c = -C(vi"o) = 
L(7ro,7ro), it follows on taking = ttq that ttq G The result now fol- 
lows from Theorem 6.1. □ 

Suppose that ttq G C fl [/ is an improper equalizer prior. One way to 
show that ttq G is to construct a sequence (r^) of priors in F for which 
(i(rfc,7ro) — > 0, where (i(r, ttq) is given by formula (5.14). As noted just prior 
to Lemma 5.5, the condition L{Tk, ttq) = J L{6,TrQ) dTk{0) is automatically 
satisfied when G F. 

We consider first the case p = 1. In this case it turns out that Jeffreys' 
prior is a minimax solution, and, hence, the assertion at the end of Example 
5.1. Let Ti. be the class of probability density functions h on (—1,1) possess- 
ing second-order continuous derivatives and that satisfy h{—l) = h'{—l) = 
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h"i-l) = h{l) = h'{l) = h"{l) = and 

(6.4) J {g'{u)}'^h{u)du<oo, 

where g{u) =\ogh{u); that is, the Fisher information associated with h is 
finite. The class Ti. is nonempty, since the density of the random variable 
U = 2V — 1, where V is any beta (a, b) density with a,b> 3, satisfies these 
conditions. 

Corollary 6.2. Suppose that p=l. Then Jeffreys^ prior is minimax 
and C(7r'^) = infi-gcj, CiT)- 

Proof. Since L{9,T^J) = 0, Jeffreys' prior is an equalizer prior. We 
therefore need to show that vr*^ G n C fl f7. Recall that vr"' € C was an 
assumption made in Section 4. Also, since Ly|x(0, vr"^) = for all n from 

(2.5) , vr-^eC/. 

If TT'^ is proper, the result now follows immediately from Corollary 6.1 since 
Cy|Jf (^'^) = for all n. Suppose then that vr'^ is improper. Without loss of 
generality, we assume that i{9) = 1, so that Jeffreys' prior is uniform. Since 
VT'-' is improper, without loss of generality we take to be either (— oo, oo) or 
(0, oo) by a suitable linear transformation. Now let C/ be a random variable 
with density /i € 7Y. 

Suppose first that B = (—00,00) and let be the density of ^ = kU . 
Clearly, Tk G T, has support [— A;, k] and /i^(^) = g'{u)/k, where fik = logTfc 
and u = 6/k. Therefore, from (5.14), 

d{rky) = ^E{g'{U)f^Q 

as /c ^ 00 from (6.4) so that vr-^ G The result now follows from Theorem 
6.1. 

Next suppose that B = (0, 00) and let be the density oiO = k(U + 1) + 1. 
Then G F, Tk has support [1,2k + 1] and /i'^(0) = g'{u)/k, where u = 
{9 -l)/k- 1. Therefore, from (5.14), 

d{rk,7r^) = ^E{g'{U)}^^0 

as k —> 00 from (6.4), so that tt"^ G and again the result follows from 
Theorem 6.1. □ 

Example 6.1. Bernoulli model. Here Jeffreys' prior is the beta (1/2, 1/2) 
distribution, which is therefore minimax from Corollary 6.2. The underlying 
Bernoulli probability mass function is f{x\9) = 9^{1 — 9y~^,x = 0,1,0 < 9 < 
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1. Let tt'^ be the density of the beta (a, a) distribution, where a > 0. It is 
straightforward to check from (5.12) that 

from which we see that L{6, tti) = —4, where vri = vr'^/^, the beta (|,|) 
distribution. Hence, the prior vri would appear to dominate Jeffreys' prior. 
In view of Corollary 6.2, however, we conclude that condition (6.3) must 
break down for this prior. Indeed, it can be shown directly that Cniy|x(0) ^i) 
is an increasing function of m for fixed n and that, when m = 1, we have 
Cn-Zvy|x(0) ■^i) = ^^ + 0(l). By the continuity of Ly\x{&tT^i) in (0) !)> it follows 
that c„ supg Ly|x (6*, vTi) oo as n —> oo for every sequence (m„) and so 
TTi ^ U . Therefore, vri exhibits poor finite sample predictive behavior relative 
to Jeffreys' prior for values of 9 close to or 1. 

It is of some interest to compare this behavior with the asymptotic min- 
imax analysis under the prior predictive regret (4.1). Under (4.1), Jeffreys' 
prior is asymptotically maximin [8], but not minimax due to its poor bound- 
ary risk behavior. However, a sequence of priors converging to Jeffreys' prior 
can be constructed that is asymptotically minimax [26]. Under our posterior 
predictive regret criterion, Jeffreys' prior is both maximin and minimax. In 
particular, it follows that it is not possible to modify the beta (|, |) distri- 
bution at the boundaries to make it asymptotically minimax. 

In the examples below our strategy for identifying a minimax prior will be 
to consider a suitable class of candidate priors in C, compute the predictive 
loss (5.12), identify the subclass of equalizer priors in U and choose the 
prior vro in this subclass, assuming it is nonempty, with minimum constant 
loss. Clearly, vro will be minimax over this subclass of equalizer priors. If, in 
addition, it can be shown that vtq G <!>"'', then the conditions of Theorem 6.1 
hold and vro is minimax over <I>. In particular, we will see that in dimensions 
greater than one, although Jeffreys' prior is necessarily impartial, it may not 
be minimax. This is not surprising, since we know that in the special case 
of transformation models the right Haar measure is the best invariant prior 
under posterior predictive loss (see Section 2). Exact minimax solutions for 
Examples 6.2 and 6.3 under the predictive regret (2.2) have recently been 
obtained by Liang and Barron [19]. Finally, all these examples are sufficiently 
regular for the strongest form R3 of remainder to hold for the priors vrg that 
are obtained. Hence, from Theorem 4.1(c), all the results will apply for an 
arbitrary amount of prediction. 

Example 6.2. Normal model with unknown mean and variance. Here 
X ~ N{l3,a'^) and 6 = {(3, a). We will show that the prior vro(6') oc is 
minimax. This is Jeffreys' independence prior, or the right Haar measure 
under the group of affine transformations of the data. 
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Consider the class of improper priors 7r"(0) oc cr~" on Q, where a €TZ. 
Transforming to (j) = (/?, A), where A = logo", these priors become vr'^(0) oc 
exp{— (a — 1)A} in the (/^-parameterization. Here we find that = 
diag(e~^'^, 2). Since p°'{4>) = logTT°-{(j)) = —(a — 1)A, it follows immediately 
from (5.2) that A{((>,7r°') = ^{a — 1)^. Furthermore, since \i{<f))\ = 2e~'^^, we 
have TT'^ {(f)) oc = '7r^((/>) so that A{(f),TT'^) = ^. It now follows from (5.12) 
that L{(j),TT°') = ^{(a — 1)^ — 1}. Therefore, all priors in this class are equal- 
izer priors and L{(j),7r"') attains its minimum value in this class when o = 1, 
which corresponds to tto{cI)) oc 1, or tto{9) oc (t~^ in the ^-parameterization. 
Note that the minimum value — ^ < 0, which is the loss under Jeffreys' prior. 

We now show that ttq G nCnU. Clearly, ttq G C, while ttq G C/ follows 
because Ly|x(^,7ro) is constant for all n since ttq is invariant under the 
transitive group of transformations of induced by the group of affine 
transformations of the observations (see Section 2). It remains to show that 
ttq G . Let Ui, U2 be independent random variables with common density 
h gH and let t/^ be the joint density of = (/3, A), where /3 = kiUi,X = k2U2 
and ki,k2 are functions of k to be determined. Let /x^ = logrfc. Then = 
k~^g'{Ur),r = 1, 2, where g = log h. Write a = j\{g' (u)}'^ h{u) du < 00 since 
h&Ti.. Since po{(j)) = log7ro((/)) is constant, it follows from (5.14) that 

d{Tk,7ro) = E[k^\'>^{g'{Ui)}^ + ^k^' {g' {U2)}'] < a{kl\^^^ + ^k^^}, 

since A < /c2- Now take ki = ke'', ^2 = k. Then (i(Tfc, ttq) < ^ ^ as k —>■ 00 
and, hence, vro G It now follows from Theorem 6.1 that ttq is minimax 
and that C('^o) = ^• 

Example 6.3. Normal linear regression. Here Xi ~ N{zfp,a'^), 
i = 1, . . . , n, where Z„ = {zi, . . . , is an n x g matrix of rank (7 > 1 
and 9 = {P,cr). Using a similar argument to that in Example 6.2, we can 
show that again Jeffreys' independence prior, or the right Haar measure, 
7ro(0) oc is minimax. 

Since the variables are not identically distributed in this example, it is 
not covered by the asymptotic theory of Sections 4 and 5. However, under 
suitable stability assumptions on the sequence (zi) of regressor variables, at 
least that Vn ^ n~^Z'^Zn is uniformly bounded away from zero and infinity, 
then a version of Theorem 5.1 will apply. 

Proceeding as in Example 6.2, we again consider the class of priors tt"'{9) oc 
a~"' on 0, where a gTZ. Transforming to (f) = {P,X), where A = log cr, 
these priors become 7r"(0) oc exp{— (a — 1)A}. Here we find that in(</') = 
diag(e~^'^V„, 2) and, exactly as in Example 6.2, we obtain A{(f),7r'^) = i(a — 
1)2. Here |in((/')| = 2\Vn\e~'^''^ so 7r"'(0) oc e~'^^ = 7r^+^((/)) for ah n, giving 
A{(f),7r'^) = and, hence, L{cf),'iT°-) = \{{a — 1)^ — q^}. Therefore, all priors 
in this class are equalizer priors and L attains its minimum value in this 
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class when a = 1, which corresponds to TTo{(j)) 1, or vro(0) oc cr~^ in the 
^-parameterization. Notice that the drop in predictive loss increases as the 
square of the number q of regressors in the model. Note also that the ratio 
|^n|~^|^n+i| is free from 6, so that a version of Theorem 4.1 will hold. 
Exactly as in Example 6.2, tto G C H C/ and it remains to show that ttq G 
Let p = q + 1 and Uj, j = 1, . . . ,p, be independent random variables with 
common density h ^TC. With the same definitions as in Example 6.2, let 
(3r = kiUr,r= l,...,q,X = k2Up, so that ^kr = k^^g'{Ur),r = 1, . . . ,q, fi^p = 
g' (JJp). Then it follows from (5.14) that, with the summations over r and 
s running from 1 to g, 

= E{k^^e^^V:'g'{Ur)g'{Us) + Ik:^"" g' {U^f} 
< a{fcf 2e2'=2trace(y-^) + ifc^^}, 

using J^^ g'{u)h{u) du = 0. Now take ki = ke^,k2 = k. Then, as before, d{Tk, 
'''"o) ^ as A; — > oo and, hence, ttq G '5^. It follows from Theorem 6.1 that ttq 

is minimax and C('''"o) = ^■ 

Interestingly, we note that the priors ttq identified in Examples 6.2 and 6.3 
also give rise to minimum predictive coverage probability bias; see [12] . The 
next example is more challenging and illustrates the difficulties associated 
with finding minimax priors more generally. 

Example 6.4. Multivariate normal. Here X ~ Nq{fi, S), with 9 compris- 
ing all elements of /i and S. Write = T'T, where T = [tij) is a lower tri- 
angular matrix satisfying tu > 0. Let fj, = {fii, . . . , /iq)', tpi = ta, 1 < ^ < = 
(V'l, . . . , Vg)', Pij = t^%, l<j<i<qand (3^^ = (Ai, • • • , A.i-i)', 2<i<q. 
Then 7 = (t/'',/?^^)', . . . is a one-to-one transformation of 6. The 

loglikelihood is 

aft ^ 2n 



i=l 



i=l {j=l 



writing Pa = l,i = 1, . . . ,q. One then finds that the information matrix ^(7) 
is block diagonal in ■01, • • • , V'gi Z?^^'* , ■ ■ ■ , P^'^^ , and is given by 

diag(2?/;f ^, . . . , 2^^"^, T/'i^ii, • • • , S"^), 

where Sjj is the submatrix of S corresponding to the first i components 
of X. Using the fact that = ]Jj=iil^j'^ ,i = l,...,q, we obtain = 

Consider the class of priors tt"'{9) oc |S| ^^''^'^ "^/^ on 0, where a eTZ. In 
the 7-parameterization, this class becomes vr"(7) oc ni=i Noting 
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that the case a = is Jeffreys' prior, it is straightforward to show from 
(5.12) that L(7,7r) = |{(a — 1)^ — 1}. Therefore, ah priors in this class are 
equaUzer priors and L attains its minimum value within this class when 
a = 1. From invariance considerations via affine transformations of X, it can 
be shown that these priors are also equalizer priors for finite n and, hence, 
are all in the class U . These results therefore suggest that the right Haar 
prior ttq {9) oc arising from the affine group is minimax. However, 

in this example it does not appear to be possible to approximate ttq by 
a sequence of compact priors, as was done in the previous examples. We 
conjecture, however, that vro can be approximated by a suitable sequence of 
proper priors so that Theorem 6.1 will give the minimaxity of ttq, but we 
have been unable to demonstrate this. This example does show, however, 
that Jeffreys' prior is dominated by ttq. 

Interestingly, further analysis reveals that the prior 7ri(7) cx; ni=i V'i"^ is 
also an equalizer prior and that it dominates ttq. In the ^-parameterization 
this prior becomes 7ri(0) oc {ni=i |5^m|}~^- However, this prior is seen to be 
noninvariant under nonsingular transformation of X and, furthermore, does 
not satisfy the boundedness condition (6.3). 

In the case g = 2, in the parameterization (j) = (/Ui, /i2, ci, cj2, p), where ai 
is the standard deviation of Xj, i = 1, 2, and p = Corr(Xi, X2), Jeffreys' prior 
and ttq become, respectively, 

7ro((A)ocarV2-i(l-p2)-3/2_ 

Therefore (see the paragraph below), ttq is Jeffreys' "two-step" prior. In the 
context of our predictive set-up, marginalization issues correspond to pre- 
dicting only certain functions of the future data Y = (X^+i, . . . ,Xn+m)- In 
general, the associated minimax predictive prior will differ from that for the 
problem of predicting the entire future data Y unless the selected statis- 
tics just form a sufficiency reduction of Y . Such questions will be explored 
in future work. Thus, if we were only interested in predicting the correla- 
tion coefficient of a future set of bivariate data, then we might start with 
the observed correlation as the data X and use Jeffreys' prior in this sin- 
gle parameter case, which is ■7r(p) oc (1 — p^)~^. For further discussion and 
references on the choice of prior in this example, see [6], page 363. 

Finally, we note the corresponding result for general q in the case /x known. 
Again, considering the class of priors ■k"'{0) oc on 0, we find 

that the optimal choice is a = 1 , so vro is as given above and in this case coin- 
cides with Jeffreys' prior. This was also shown to be a predictive probability 
matching prior in [12] in the case q = 2. 

Under the conditions of Theorem 6.1, it is possible to change the base 
measure from Jeffreys' prior to ttq, since ttq is neutral with respect to tt"' 
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under L{9,Tr). Denoting quantities with respect to the base measure ttq with 
a zero subscript, since L{9, ttq) = c < and C('^o) = — c, we have, for vr G IIoo, 

Lo{e, 7t) = L{e, vr) - L{e, TTo) = L{e, vr) - c 

and for r S 

Co(r)=C(r)+c. 

Therefore, with respect to the base measure ttq, the predictive loss under ttq 
becomes Lq{9,-kq) = and the minimum predictive information, attained at 
vr = TTo, is zero. 

7. Discussion. In this paper we have obtained an asymptotic predic- 
tive loss function that reflects the finite sample size predictive behavior of 
alternative priors when the sample size is large for arbitrary amounts of 
prediction. This loss function is related to that in [14] for the comparison 
of estimative predictive distributions based on Bayes estimators. It can be 
used to derive nonsubjective priors that are impartial, minimax and max- 
imin, which is equivalent here to minimizing a measure of the predictive 
information contained in a prior. In dimensions greater than one, unlike an 
analysis based on prior predictive regret, the maximin prior may not be 
Jeffreys' prior. A number of examples have been given to illustrate these 
ideas. 

As discussed in [23], as model complexity increases, it becomes more dif- 
ficult to make sensible prior assignments, while at the same time the effect 
of the prior specification on the final inference of interest becomes more pro- 
nounced. It is therefore important to have sound methodology available for 
the construction and implementation of priors in the multiparameter case. 
We believe that our preliminary analysis of the posterior predictive regret 
(2.1) indicates that it should be a valuable tool for such an enterprise. More 
extensive analysis is now required, particularly aimed at developing gen- 
eral methods of finding exact and approximate solutions for the practical 
implementation of this work and investigating connections with predictive 
coverage probability bias. Local priors (see, e.g., [23, 24]) are expected to 
play a role. It would also be interesting to develop asymptotically impartial 
minimax posterior predictive loss priors for dependent observations and for 
various classes of nonregular problems. In particular, all the definitions in 
Section 2 for nonasymptotic settings will apply and could be used to explore 
predictive behavior numerically. 
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